The increasing size of datasets in drug discovery makes it challenging to build robust and accurate predictive models\nwithin a reasonable amount of time. In order to investigate the effect of dataset sizes on predictive performance and\nmodelling time, ligand-based regression models were trained on open datasets of varying sizes of up to 1.2 million\nchemical structures. For modelling, two implementations of support vector machines (SVM) were used. Chemical\nstructures were described by the signatures molecular descriptor. Results showed that for the larger datasets, the\nLIBLINEAR SVM implementation performed on par with the well-established libsvm with a radial basis function kernel,\nbut with dramatically less time for model building even on modest computer resources. Using a non-linear kernel\nproved to be infeasible for large data sizes, even with substantial computational resources on a computer cluster. To\ndeploy the resulting models, we extended the Bioclipse decision support framework to support models from LIBLINEAR\nand made our models of logD and solubility available from within Bioclipse.
Loading....